REPORT  DOCUMENTATION  PAGE 


Form  Approved  OMB  NO.  0704-0188 


The  public  reporting  burden  for  this  coilection  of  information  is  estimated  to  average  1  hour  per  response,  inciuding  the  time  for  reviewing  instructions, 
searching  existing  data  sources,  gathering  and  maintaining  the  data  needed,  and  compieting  and  reviewing  the  coiiection  of  information.  Send  comments 
regarding  this  burden  estimate  or  any  other  aspect  of  this  coilection  of  information,  including  suggesstions  for  reducing  this  burden,  to  Washington 
Headquarters  Services,  Directorate  for  information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Ariington  VA,  22202-4302. 
Respondents  shouid  be  aware  that  notwithstanding  any  other  provision  of  iaw,  no  person  shaii  be  subject  to  any  oenaity  for  failing  to  comply  with  a  coiiection 
of  information  if  it  does  not  dispiay  a  currentiy  vaiid  OMB  controi  number. 

PLEASE  DO  NOT  RETURN  YOUR  FORM  TO  THE  ABOVE  ADDRESS. 


2.  REPORT  TYPE 
Final  Report 


1.  REPORT  DATE  (DD-MM-YYYY) 
23-11-2015 


4.  TITLE  AND  SUBTITLE 

Final  Report:  Scalable  Biomarker  Discovery  for  Diverse  High- 
Dimensional  Phenotypes 


3.  DATES  COVERED  (From  -  To) 
l-Oct-2011  -  31-Jul-2015 


5a.  CONTRACT  NUMBER 
W911NF-1 1-1-0429 


5b.  GRANT  NUMBER 


6.  AUTHORS 
Curtis  Huttenhower 


5c.  PROGRAM  ELEMENT  NUMBER 
611102 


5d.  PROJECT  NUMBER 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 


7.  PERFORMING  ORGANIZATION  NAMES  AND  ADDRESSES 

Harvard  School  of  Public  Health 
Biostatistics 

President  and  Fellows  of  Harvard  College 
Boston,  MA  02115  -6028 


9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS 
(ES) 

U.S.  Army  Research  Office 
P.O.Box  12211 

Research  Triangle  Park,  NC  27709-2211 


8.  PERFORMING  ORGANIZATION  REPORT 
NUMBER 


10.  SPONSOR/MONITOR'S  ACRONYM(S) 
ARO 


11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

60119-MA.16 


12.  DISTRIBUTION  AVAILIBILITY  STATEMENT 
Approved  for  Public  Release;  Distribution  Unlimited 


13.  SUPPLEMENTARY  NOTES 

The  views,  opinions  and/or  findings  contained  in  this  report  are  those  of  the  author(s)  and  should  not  contrued  as  an  official  Department 
of  the  Army  position,  policy  or  decision,  unless  so  designated  by  other  documentation. 


14.  ABSTRACT 

Historically,  most  biological  and  medical  investigations  have  examined  a  few  discrete  outcomes  of  interest,  and 
only  a  few  controllable  parameters  were  modified  to  perturb  or  improve  these  outcomes.  Investigations  of  this  form 
went  hand-in-hand  with  the  development  of  inferential  statistics,  which  provide  the  quantitative  tools  to  detect 
which  perturbations  successfully  improve  outcomes.  Biological  and  clinical  research  has  entered  a  realm  of 
modifying  hundreds  or  thousands  of  experimental  parameters  in  high  throughput,  however,  and  high-dimensional 

/-%4-  "U  -r  T  y-v  y-v  y-v*'^  /“I  y-VT  T  y-V  1  y-V  y-V  /“I  4"  y-V  1  1  /“I  y-V  y-1  4"  /“I  T -T  r'U  y-fc'U  y-V  4^  4-T'%  y-V  y-1  y-V  y-V  /“I  -t  4- -9  y-V  i-1  -t  4-1  1  i-1  1  1  4- 1  1-  r  -T  T  H  Hi  1  4"  H  H  HI  H 


15.  SUBJECT  TERMS 

high-dimensional  data,  association  testing,  multiple  input  multiple  output,  statistical  methods 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

15.  NUMBER 

a.  REPORT 

b.  ABSTRACT 

c.  THIS  PAGE 

ABSTRACT 

OF  PAGES 

UU 

UU 

UU 

UU 

Curtis  Huttenhower 


19b.  TELEPHONE  NUMBER 
617-432-4912 


Standard  Form  298  (Rev  8/98) 
Prescribed  by  ANSI  Std.  Z39. 18 


Report  Title 

Final  Report:  Scalable  Biomarker  Discovery  for  Diverse  High-Dimensional  Phenotypes 

ABSTRACT 

Historically,  most  biological  and  medical  investigations  have  examined  a  few  discrete  outcomes  of  interest,  and  only  a  few  controllable 
parameters  were  modified  to  perturb  or  improve  these  outcomes.  Investigations  of  this  form  went  hand-in-hand  with  the  development  of 
inferential  statistics,  which  provide  the  quantitative  tools  to  detect  which  perturbations  successfully  improve  outcomes.  Biological  and 
clinical  research  has  entered  a  realm  of  modifying  hundreds  or  thousands  of  experimental  parameters  in  high  throughput,  however,  and  high¬ 
dimensional  statistics  have  been  developed  to  understand  which  of  these  modifications  in  turn  significantly  improve  outcome.  The  field  has 
now  reached  a  point  where  hundreds  or  thousands  of  outcomes  can  be  simultaneously  measured  as  well,  but  few  statistical  tools  exist  to 
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treatments  are  modified,  which  modifications  are  significantly  associated  with  improved  outcomes?"  This  project  thus  aims  to  develop  novel 
statistical  methods  for  efficiently  associating  many  controllable  predictor  variables  with  many  observed  response  variables  with  high 
sensitivity  and  specificity. 
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HAIIA's  implementation  as  a  Python  package  (see  http://huttenhower.sph.harvard.edu/halla)  has  been  carried  out  in 
collaboration  with  Weingart  Informatics,  an  independent  software  development  contractor  in  San  Francisco.  This  has  allowed 
academic  development  and  validation  of  the  algorithm  to  be  carried  out  efficiently  by  students  and  postdoctoral  fellows,  while 
Dr.  Weingart  has  provided  industry-quality  code,  unit  testing,  packaging,  documentation,  and  distribution.  He  has  begun 
expanding  this  implementation  to  a  generalizable  Python  platform  for  scientific  workflow  execution,  AnADAMA,  which  may  be  a 
target  for  future  industry  partnership  in  the  lab. 


60119-MA:  Scalable  biomarker  discovery  for  diverse  high-dimensional  phenotypes 

Associate  Professor  Curtis  Huttenhower,  Department  of  Biostatistics,  Harvard  School  of  Public  Health 

Our  final  technical  report  for  this  project  includes  completion  of  all  methodological  development 
for  HAllA  (the  Hierarchical  All-against  All  input/output  association  testing  approach),  in 
addition  to  work  previously  completed  for  the  MaAsLin  (Multivariate  Analysis  with  Linear 
models)  system  for  high-dimensional  biomarker  discovery  in  compositional  data.  Both  are 
available  as  open-source  software  packages  with  documentation,  demonstration  data,  and 
tutorials  at  http://huttenhower.sph.harvard.edu/halla  and 

http://huttenhower.sph.harvard.edu/maaslin,  respectively.  Our  final  publication  list  includes  six 
manuscripts  (PMIDs  22699609,  22699610,  24629344,  23949665,  23013615,  and  25732063)  and 
two  currently  in  review  (HAllA  and  its  application  to  the  oral  microbiome). 


Problem  Statement 

Historically,  most  biological  and  medical  investigations  have  examined  a  few  discrete  outcomes 
of  interest,  and  only  a  few  controllable  parameters  were  modified  to  perturb  or  improve  these 
outcomes.  Investigations  of  this  form  went  hand-in-hand  with  the  development  of  inferential 
statistics,  which  provide  the  quantitative  tools  to  detect  which  perturbations  successfully  improve 
outcomes.  Biological  and  clinical  research  has  entered  a  realm  of  modifying  hundreds  or 
thousands  of  experimental  parameters  in  high  throughput,  however,  and  high-dimensional 
statistics  have  been  developed  to  understand  which  of  these  modifications  in  turn  significantly 
improve  outcome.  The  field  has  now  reached  a  point  where  hundreds  or  thousands  of  outcomes 
can  be  simultaneously  measured  as  well,  but  few  statistical  tools  exist  to  answer  the  question, 
"When  many  experimental  or  patient  outcomes  are  measured  simultaneously,  and  many 
experimental  parameters  or  treatments  are  modified,  which  modifications  are  significantly 
associated  with  improved  outcomes?"  This  project  thus  aims  to  develop  novel  statistical  methods 
for  efficiently  associating  many  controllable  predictor  variables  with  many  observed  response 

variables  with  high  sensitivity  and  specificity  (Fig,  1). 
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Figure  1:  Overview  of  HAllA.  A)  Two  or  more  input  datasets  are  represented  in  matrix  form  as  features  (rows)  and 
samples  (columns).  B)  Continuous  data  are  (optionally)  discretized  to  provide  a  unified  representation  of  potentially 
heterogeneous  feature  types.  C)  Features  within  each  data  set  are  single  linkage  hierarchically  clustered,  using 
normalized  mutual  information  as  the  default,  fully  generalizable  similarity  metric.  D)  A  hypothesis  tree  is  built  by 
coupling  clusters  between  two  datasets  at  equivalent  relative  depths.  Each  hypothesis  node  has  compares  two 


clusters,  with  all  pairs  of  children  of  the  two  clusters  forming  the  next  level  of  hypothesis  testing.  E)  Hypothesis 
testing  is  performed  by,  first,  selecting  each  cluster's  medoid  as  a  representative  summary  (optionally  either  multiple 
correspondence  analysis  or  principle  component  analysis  instead),  and  a  permutation  test  is  then  used  to  determine 
which  pairs  are  significantly  associated  between  the  two  datasets.  F)  Significant  associations  are  reported  after  false 
discovery  rate  controlling  for  each  hypotheses  set  (family,  level,  or  all). 

Results  Summary 

Generalized  multiple  input/output  association  testing:  the  Hierarchical  All-against-All  approach 

HAllA  (Hierarchical  All-against-All  association  testing)  is  a  novel  statistical  method  for  well- 
powered  association  discovery  in  high-dimensional  heterogeneous  datasets,  which  we  have 
developed,  implemented,  validated,  and  applied  to  diverse  datasets.  It  combines  hierarchical 
hypothesis  testing  with  false  discovery  rate  correction  over  highly  generalizable  association 
measures,  yielding  high-powered  discovery  of  linear  and  non-linear  patterns  in  categorical  or 
continuous  high-dimensional  data.  Data  and  metadata  to  be  associated  are  hierarchically 
clustered  for  dimensionality  reduction,  and  nonparametric  permutation  testing  identifies 
relationships  between  the  resulting  blocks  of  correlated  features. 

HAllA  was  validated  and  optimized  with  synthetic  data,  outperforming  exhaustive  all-against-all 
association  testing  and  alternative  similarity  measures  such  as  the  Maximum  Information 
Coefficient  and  Spearman  correlation  in  Types  I  and  II  error  and  in  runtime  (Fig,  2).  The 
recommended  HAllA  algorithm  first  groups  features  by  single-linkage  agglomerative 
hierarchical  clustering  using  normalized  mutual  information  (NMI),  then  compares  blocks  of 
features  by  descending  each  tree  logarithmically.  Outliers  are  removed  by  comparing  the  relative 
dissimilarities  within  and  between  datasets,  after  which  cluster  medoids  are  compared  (also  using 
NMI)  and  significant  associations  determined  by  permutation  testing.  False  discovery  rates  are 
globally  controlled  while  maintaining  power  by  correcting  within  each  tree  level.  Each  choice  in 
the  algorithm  is  modularized  and  configurable,  however,  allowing  users  to  select  other  measures 
more  appropriate  for  homogeneous  data  (e.g.  continuously  valued  Spearman  correlations)  or  to 
use  other  block  summarization  techniques  (e.g.  MCA  or  PCA).  The  software  implementation  and 
documentation  are  available  at  http://huttenhower.sph.harvard.edu/halla. 
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Figure  2:  HAllA  efficiently  maintains 
Type  I  and  II  error  relative  to 
alternative  approaches.  We  simulated 
30  datasets  with  features  associated  by 
linear,  parabolic,  and  sinusoidal 
relationships.  HAllA  outperformed 
exhaustive  all-against-all  (AHA)  testing 
using  multiple  measures:  the  maximum 
information  coefficient  (MIC), 

nonnalized  mutual  information  (NMI) 
alone,  and  alternative  summarizations 
including  MCA  or  PCA. 


We  created  the  Python  package  STRUDEL  (Stratified  Rudimentary  Data  Exploration)  to 
produce  synthetic  data  with  defined  correlation  structure  within  datasets  and  associations 
between  blocks  of  variables  among  datasets.  Each  evaluation  dataset  contained  correlated 


features,  and  eaeh  paired  dataset  eontained  eorrelations  between  bloeks  of  features  with  both 
linear  and  non-linear  patterns.  Using  these  simulated  data,  we  evaluated  the  performance  of 
HAllA  and  naive  all-against-all  (AHA)  methods  using  NMI,  MIC,  distance  correlation  (dcor), 
and  Spearman  correlation  as  similarity  metrics.  Simple  correlations  identify  only  linear 
associations,  while  dcor  requires  continuous  data  and  performed  poorly  for  discovery  of  complex 
nonlinear  associations.  MIC  was  very  time-consuming  even  for  small  datasets  often  had  a  high 
FDR.  However,  NMI  efficiently  found  all  types  of  associations  (linear,  parabolic,  and  sinusoidal) 
and  when  applied  using  hierarchical  testing  improved  both  precision  and  recall.  As  a  negative 
control,  applying  HAllA  to  null  datasets  containing  no  associations  resulted  in  the  expected 
baseline  false  discovery  rate  (a  user  configurable  parameter). 

We  applied  HAllA  to  three  real-world  datasets  containing  highly  diverse  data  types  from  several 
application  areas.  First,  we  assessed  data  published  by  Martin  et  al  (Hepatology  2007)  for  21 
liver  lipid  levels  and  120  genes'  hepatic  transcription  levels  in  40  wild-type  and  peroxisome 
proliferator-activated  receptor-a  (PPARa)-deficient  mice.  HAllA  recovered  all  associations 
previously  reported  by  Gonzalez  et  al  (J.  Stat.  Software  2008)  using  canonical  correlation 
analysis,  including  for  example  transcriptional  activation  of  xenobiotic  metabolism  genes 
CypSall  and  CARl  in  conjunction  with  high  fatty  acid  levels.  We  further  identified  novel 
associations  between  a  cluster  of  transcripts  including  CARl  and  ACOTH  (fatty  acid  transport 
and  trafficking)  and  a  cluster  of  fatty  acids  including  Cl 8.0,  C20.3n.6,  and  C20.4n.6.  The 
associations  identified  by  HAllA  were  a  strict  superset  of  those  previously  reported, 
demonstrating  the  utility  of  the  method  as  a  general-purpose  tool  that  can  find  general  patterns 
with  high  sensitivity  (Fig,  3). 

Figure  3:  HAllA  identifies  validated  and 
novel  assoeiations  between  mouse  liver 
transeripts  and  fatty  aeid  levels. 

Associations  between  hepatic  fatty  acids  and 
gene  expression  in  data  from  Martin  et  al 
(Hepatology  2007).  indicates  significant 
associations  identified  by  HAllA  between 
corresponding  genes  and  fatty  acids,  a  strict 
superset  of  those  previously  reported. 

Next,  HAllA  expanded  upon  microbe-metabolite  associations  reported  the  infant  gut  microbiome 
by  Kostic  et  al  (CHM  2015).  This  test  of  the  hygiene  hypothesis  studied  a  prospective  cohort  of 
960  Finnish,  Russian,  and  Estonian  infants  at  risk  of  type  1  diabetes.  These  subjects  were 
followed  for  three  years,  while  monthly  stool  samples  and  a  variety  of  clinical  metadata  (e.g. 
breastfeeding,  diet,  allergies)  were  collected.  The  resulting  104  stool  samples  were  assessed  for 
microbial  community  composition  by  16S  rRNA  gene  sequencing  and  profiled  metabolomically. 
We  applied  HAllA  to  the  abundances  of  20  genera  and  284  metabolites  from  these  data,  again 
identifying  a  set  of  relationships  that  were  previously  detected  (e.g.  Veilonella  and 
sphyingomyelins  22:0,  24:0,  24:1,  C16:0,  and  C18:0;  Ruminococcus  and  sphingomyelins  24:0 
and  24:1).  We  further  recovered  an  association  between  Haemophilus  and  phenylalanine  in 
conjunction  with  a  novel  grouping  with  tryptophan  and  tyrosine.  Blautia  spp.  were  non-linearly 
associated  with  long-chain  triglycerides,  whereas  Ruminococcus  spp.  associated  with  short-chain 
triglycerides  and  Lactobacillus  spp.  with  both.  Finally,  Enterococcus  was  positively  associated 
with  diacyl  and  triacyl  glycerols,  in  agreement  with  independent  reports  of  Enterococcus  faecium 
bioactivity  in  improving  diabetic  lipid  (Roselino  Lipids  Health  2012)  and  triglyceride  (Cavallini 
Lipids  Health  2009)  levels. 


Finally,  we  applied  HAllA  to  data  eollected  from  204  ileal  reseetion  patients  in  which  microbial 
community  profiles  and  host  transcrip  tomes  were  assessed  from  255  biopsies  (Morgan  CHM 
2015).  Host  transcription  was  assayed  by  Affymetrix  microarray  and  microbiome  profiles  by  16S 
rRNA  gene  sequencing.  Previous  work  correlated  antibiotic  use,  inflammation,  biopsy  location, 
and  clinical  outcome  with  the  transcriptome  and  microbiome,  but  extensive  dimensionality 
reduction  was  required  to  preserve  power  while  determining  microbe-transcript  associations.  We 
applied  HAllA  to  a  highly-abundant  (above  50th  percentile)  and  variant  subset  (above  95th 
percentile),  comprising  108  microbes  and  498  transcripts.  We  found  576  associations  between 
features  and  feature  clusters;  34  included  at  least  one  of  the  top  gene  principal  component 
loadings  described  previously.  Of  these,  21  corresponded  to  the  genes  most  influential  in  the 
previously  reported  principal  component  9,  which  represented  only  1%  of  expression  variation 
but  was  correlated  with  the  most  microbes. 

Additionally,  we  found  that  the  abundance  of  Escherichia  was  correlated  with  host  expression  of 
MUCl  and  nitric  oxide  synthase.  Expression  of  the  sodium  /  bile  acid  transporter  SLC10A2,  the 
primary  bile  transporter  in  the  distal  ileum,  was  correlated  with  the  genera  Coprococcus,  Blautia, 
and  Clostridium  from  the  family  Erysipelotrichaceae.  Notably,  antibiotic  use  was  associated  with 
increased  SEC10A2  expression  and  total  bile  acids,  in  agreement  with  Miyata  et  al  (J.  Pharma. 
Exp.  Therapy  2011),  and  both  Crohn’s  disease  and  ulcerative  colitis  have  previously  been 
associated  with  decreased  metagenomic  abundance  of  bile  salts  among  the  Eachnospiraceae  and 
Erysipelotrichaceae  (Eabbe  PEoS  ONE  2014). 

The  manuscript  describing  HAllA  is  currently  in  review,  and  it  has  previously  been  presented  at 
the  2015  Broad  Institute  retreat,  the  2015  Statistical  and  Applied  Mathematical  Sciences 
Institute  workshop  on  Discovering  Patterns  in  Human  Microbiome  Data,  the  2014  Intelligent 
Systems  for  Molecular  Biology  (ISMB)  conference,  and  the  2014  Dana-Earber  Cancer  Insistute 
Biostatistics  and  Computational  Biology  seminar  series.  Its  software  and  evaluation  package  are 
complete,  documented,  and  supported  by  online  tutorials  and  an  active  user  group  (see 
http://huttenhower.sph.harvard.edu/halla  and  http://bitbucket.org/biobakery/halla).  It  is  currently 
in  use  by  ongoing  collaborations  with  Dr.  Jessica  Green  at  the  University  of  Oregon,  Dr.  Frank 
Nestle  (King's  College  for  the  MAARS  study),  and  Dr.  Mihai  Netea  (Radboud  University 
Nijmegen).  Postdoctoral  fellow  Dr.  Gholamli  Rahnavard  is  the  current  lead,  in  collaboration  with 
research  associate  Dr.  Eric  Eranzosa  and  completing  work  previously  executed  by  former 
research  assistant  Yo  Sup  Moon  and  postdoctoral  fellows  Timothy  Tickle  (currently  at  the  Broad 
Institute)  and  Eevi  Waldron  (currently  at  Hunter  College). 

Multiple  input/output  microbial  community  methods:  Multivariate  Analysis  with  Linear  models 

The  Multivariate  Analysis  with  Einear  models  (MaAsEin)  method  performs  efficient  high¬ 
dimensional  association  testing  specifically  for  microbial  communities.  Microbial  taxonomic  and 
functional  profiles  possess  unique  statistical  characteristics,  in  particular  a  combination  of 
sparsity  (many  zero  values),  compositionality  (fixed  per-sample  sum)  or  count  data,  and 
longitudinal  variability  (i.e.  repeated  measures),  including  typical  epidemiological  characteristics 
of  heterogeneous  high-dimensional  metadata.  MaAsEin  was  developed  as  a  well-powered 
statistical  association  methodology  and  software  appropriate  for  microbial  data  in  conjunction 
with  any  experimental  design  and  associated  phenotypic,  clinical,  environmental,  or  'omic 
metadata. 


The  MaAsLin  algorithm  combines  four  main  steps  to  associate  relative  abundance  profiles  with 
heterogeneous  metadata  (Fig,  4).  First,  data  are  preprocessed  to  remove  outliers,  handle  missing 
values,  and  ensure  quality  control  and  consistency.  Next,  one  of  three  link  functions  appropriate 
for  microbial  profiles  is  applied;  a  variance-stabilizing  arcsin  square-root  transformation  by 
default,  or  a  log-linear  Gaussian  link,  or  a  negative  Binomial  count  link.  The  first  (arcsin-sqrt)  is 
an  approximation  that  is  less  theoretically  appropriate  than  the  latter  two,  but  performs 
comparably  well  in  evaluations  and  is  substantially  more  computationally  efficient  and 
numerically  stable.  Third,  dimensionality  is  reduced  if  necessary  using  a  feature  selection  step 
(boosting  by  default,  LASSO  and  univariate  selection  optionally  available).  Finally,  significant 
associations  are  identified  using  a  mixed  effects  linear  model  (zero-inflated  by  default). 
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Figure  4:  Multivariate  linear  models 
(MaAsLin)  for  high-dimensional  relative 
abundance  profile  association  with 
multivariate  metadata.  Multivariate 
Association  by  Linear  models  (MaAsLin)  uses  a 
dimensionality-reduced  mixed  effects  model  to 
identify  significant  covariation  between 
microbes  or  microbial  functions  and  sample 
metadata  (phenotype,  environment,  etc.) 


MaAsLin  has  been  applied  in  a  variety  of  published  and  ongoing  studies  during  the  total  project 
period; 


•  Morgan  XC,  Kabakchiev  B,  Waldron  L,  Tyler  AD,  Tickle  TL,  Milgrom  R,  Stempak  JM, 
Gevers  D,  Xavier  RJ,  Silverberg  MS,  Huttenhower  C.  "Associations  between  host  gene 
expression,  the  mucosal  microbiome,  and  clinical  outcome  in  the  pelvic  pouch  of  patients 
with  inflammatory  bowel  disease."  Genome  Biol.  2015  Apr  8;16(1);67 

•  Kostic  AD,  Gevers  D,  Siljander  H,  Vatanen  T,  Hyotylainen  T,  Hamalainen  AM,  Peet  A, 
Tillmann  V,  Poho  P,  Mattila  I,  Lahdesmaki  H,  Franzosa  EA,  Vaarala  O,  de  Goffau  M, 
Harmsen  H,  Ilonen  J,  Virtanen  SM,  Clish  CB,  Oresic  M,  Huttenhower  C,  Knip  M; 
DIABIMMUNE  Study  Group,  Xavier  RJ.  "The  Dynamics  of  the  Human  Infant  Gut 
Microbiome  in  Development  and  in  Progression  toward  Type  1  Diabetes."  Cell  Host 
Microbe,  2015  Eeb  ll;17(2);260-73 

•  Haberman  Y,  Tickle  TE,  Dexheimer  PJ,  Kim  MO,  Tang  D,  Karns  R,  Baldassano  RN,  Noe 
JD,  Rosh  J,  Markowitz  J,  Heyman  MB,  Griffiths  AM,  Crandall  WV,  Mack  DR,  Baker  SS, 
Huttenhower  C,  Keljo  DJ,  Hyams  JS,  Kugathasan  S,  Walters  TD,  Aronow  B,  Xavier  RJ, 
Gevers  D,  Denson  EA.  "Pediatric  Crohn's  disease  patients  exhibit  specific  ileal  transcriptome 
and  microbiome  signature."  J  Clin  Invest,  2015  Mar  2;I25(3);I363 

•  Knights  D,  Silverberg  MS,  Weersma  RK,  Gevers  D,  Dijkstra  G,  Huang  H,  Tyler  AD,  van 
Sommeren  S,  Imhann  E,  Stempak  JM,  Huang  H,  Vangay  P,  Al-Ghalith  GA,  Russell  C,  Sauk 
J,  Knight  J,  Daly  MJ,  Huttenhower  C,  Xavier  RJ.  "Complex  host  genetics  influence  the 
microbiome  in  inflammatory  bowel  disease."  Genome  Med,  2014  Dec  2;6(12);107 

•  Gevers  D,  Kugathasan  S,  Denson  EA,  Vazquez-Baeza  Y,  Van  Treuren  W,  Ren  B,  Schwager 
E,  Knights  D,  Song  SJ,  Yassour  M,  Morgan  XC,  Kostic  AD,  Euo  C,  Gonzalez  A,  McDonald 
D,  Haberman  Y,  Walters  T,  Baker  S,  Rosh  J,  Stephens  M,  Heyman  M,  Markowitz  J, 
Baldassano  R,  Griffiths  A,  Sylvester  E,  Mack  D,  Kim  S,  Crandall  W,  Hyams  J,  Huttenhower 
C,  Knight  R,  Xavier  RJ.  "The  treatment-naive  microbiome  in  new-onset  Crohn's  disease." 
Cell  Host  and  Microbe,  2014  Mar  12;15(3);382-92 


•  Rooks  MG,  Veiga  P,  Wardwell-Scott  LH,  Tickle  T,  Segata  N,  Michaud  M,  Gallini  CA,  Beal 
C,  van  Hylekama-Vlieg  JE,  Ballal  SA,  Morgan  XC,  Gliekman  JN,  Gevers  D,  Huttenhower  C, 
Garrett  WS.  "Gut  mierobiome  eomposition  and  funetion  in  experimental  eolitis  during  aetive 
disease  and  treatment-induced  remission."  ISME  J,  2014  Eeb.  6 

•  Smeekens  SP,  Huttenhower  C,  Riza  A,  van  de  Veerdonk  PE,  Zeeuwen  PL,  Sehalkwijk  J,  van 
der  Meer  JW,  Xavier  RJ,  Netea  MG,  Gevers  D.  "  Skin  Mierobiome  Imbalanee  in  Patients 
with  STAT1/STAT3  Defeets  Impairs  Innate  Host  Defense  Responses."  J.  Innate  Imm.  2013 

•  Huttenhower  C,  Gevers  D,  Knight  R,  The  Human  Mierobiome  Projeet  Consortium,  White  O. 
"Strueture,  funetion  and  diversity  of  the  healthy  human  mierobiome."  Nature,  2012 
486(7402):207-14 

•  Methe  BA,  Nelson  KE,  Pop  M,  Creasy  HH,  Giglio  MG,  Huttenhower  C,  The  Human 
Mierobiome  Projeet  Consortium,  White  O.  "A  framework  for  human  mierobiome  researeh." 
Nature,  2012  486(7402):215-21 

•  Morgan  XC*,  Tiekle  TL*,  Sokol  H*,  Gevers  D,  Devaney  KL,  Ward  DV,  Reyes  JA,  Shah 
SA,  LeLeiko  N,  Snapper  SB,  Bousvaros  A,  Korzenik  J,  Sands  BE,  Xavier  RJ,  Huttenhower 
C.  "Dysfunetion  of  the  Intestinal  Mierobiome  in  Inflammatory  Bowel  Disease  and 
Treatment."  Genome  Biology,  2012,  13:R80 

•  Tommi  Vatanen,  Aleksandar  D.  Kostie,  Eva  d'Hennezel,  Heli  Siljander,  Erie  A.  Pranzosa, 
Moran  Yassour,  Raivo  Kolde,  Hera  Vlamakis,  Anu-Maaria  Hamalainen,  Aleksandr  Peet, 
Vallo  Tillmann,  Raivo  Uibo,  Sergei  Mokurov,  Natalya  Dorshakova,  Jorma  Ilonen,  Suvi  M. 
Virtanen,  Susanne  J.  Szabo,  Jeff  Porter,  Harri  Lahdesmaki,  Curtis  Huttenhower,  Dirk  Gevers, 
Thomas  W.  Cullen,  Mikael  Knip,  on  behalf  of  the  DIABIMMUNE  Study  Group,  and 
Ramnik  J.  Xavier.  "Prom  Metagenomies  to  Mechanism:  EPS  Immunogenicity  Links  the 
Infant  Mierobiome  to  Autoimmunity."  in  revision 

•  Tiffany  Hsu,  Regina  Joiee,  Jose  Vallarino,  Galeb  Abu-Ali,  Eriea  M.  Hartmann,  Afirah 
Shafquat,  Casey  DuLong,  Catherine  Baranowski,  Dirk  Gevers,  Jessiea  L.  Green,  Xoehitl  C. 
Morgan,  John  D.  Spengler,  Curtis  Huttenhower.  "Urban  transit  system  mierobial 
eommunities  differ  by  surfaee  type  and  interaetion  with  humans  and  environment."  in  review 

•  Daniela  Bornigen,  Boyu  Ren,  Robert  Piekard,  Jingfeng  Li,  Eriea  M.  Hartmann  Weihong 
Xiao,  Timothy  Tiekle,  Jennifer  Rider,  Dirk  Gevers,  Mary  Ellen  Davey,  Curtis  Huttenhower, 
Maura  Gillison.  "Alterations  in  the  oral  mierobiome  assoeiated  with  oral  caneer  risk  factors 
and  their  eontributions  to  pathogenesis."  in  review 

MaAsLin  has  also  been  presented  in  a  wide  range  of  venues,  ineluding: 

•  "High-preeision  functional  profiling  of  the  gut  mierobiome  for  eharaeterization  during 
inflammatory  disease."  University  of  Pittsburgh  Immunology  seminar.  Pittsburgh,  PA,  2015 

•  "Mierobial  eommunities  in  the  Boston  MBTA  mass  transit  system."  MetaSUB  Pirst 
International  Summit  on  Metagenomies  and  Metadesign  of  Subways  and  Urban  Biomes. 
New  York,  NY,  2015 

•  "Towards  systems-level  functional  profiling  of  microbial  communities  and  the  human 
mierobiome."  University  of  Pennsylvania  Mierobiology  seminar.  Philadelphia,  PA,  2015 

•  "High-preeision  Punctional  Profiling  of  Mierobial  Communities  and  the  Human 
Mierobiome."  41st  Annual  Northeast  Bioengineering  Conference.  Albany,  NY,  2015 

•  "High-preeision  funetional  profiling  of  microbial  communities  and  the  human  mierobiome." 
Simons  Poundation  Symposium  on  Genomies  in  Single  Cells  and  Mierobiomes.  New  York, 
NY,  2015 


•  "A  Tour  of  the  bioBakery:  Computational  Tools  for  Microbial  Community  Analysis."  Broad 
Institute  Medical  and  Population  Genetics  seminar.  Cambridge,  MA,  2015  (presented  by  Eric 
Franzosa) 

•  "The  human  microbiome  and  biomarker  discovery."  Harvard  CATALYST  Understanding 
Biomarker  Science  workshop.  Boston,  MA,  2015 

•  "Towards  systems-level  functional  profiling  of  microbial  communities  and  the  human 
microbiome."  Channing  Division  of  Network  Medicine  Theodore  L.  Badger  Lecture.  Boston, 
MA,  2015 

•  "Characterizing  the  gut  microbial  ecosystem  for  diagnosis  and  therapy  in  inflammatory 
bowel  disease."  Keystone  Symposium  on  Gut  Microbiota  Modulation  of  Host  Physiology. 
Keystone,  CO,  2015 

•  "The  microbiome  in  IBD  and  analysis  methods  for  microbial  communities."  International 
Inflammatory  Bowel  Disease  Genetics  Consortium  meeting.  Barcelona,  Spain,  2015 

•  "An  Introduction  to  Microbial  Community  Analyses."  Evomics  and  Genomics  workshop. 
Cesky  Krumlov,  Czech  Republic,  2015 

•  "High-specificity  methods  for  profiling  microbial  communities  and  the  human  microbiome." 
University  of  Oregon  Computer  Science  colloquium.  Eugene,  OR,  2014 

•  "Metagenomics,  metatranscrip tomics,  and  multi ’omic  integration."  Massachusetts  General 
Hospital  Center  for  the  Study  of  Inflammatory  Bowel  Disease  research  symposium.  Boston, 
MA,  2014 

•  "High-precision  methods  for  metagenomic  and  metatranscriptomic  profiling."  New  York 
University  Medical  School  seminar.  New  York,  NY,  2014 

•  "Gut  microbial  epidemiology  and  biogeography."  University  of  Washington  Genome 
Sciences  seminar.  Seattle,  WA,  2014 

•  "High-precision  profiling  of  microbial  communities  and  the  human  microbiome."  University 
of  Oregon  Institute  for  Theoretical  Sciences  seminar.  Eugene,  OR,  2014 

•  "Computational  Approaches  for  the  Human  Microbiome  in  Health  and  Disease,"  12th 
Biennial  Congress  of  the  Anaerobe  Society  of  the  Americas.  Chicago,  IE,  2014 

•  "An  introduction  to  the  microbiome  and  quantitative  methods  for  microbial  community 
analysis,"  HSPH  Biostatistics  Summer  Program  in  Quantitative  Sciences.  Boston,  MA,  2014 

•  "An  introduction  to  the  microbiome  and  methods  for  microbial  community  analysis," 
Harvard/MIT  Minority  Introduction  to  Engineering,  and  Science.  Boston,  MA,  2014 

•  "An  introduction  to  metagenomics,"  Strategies  and  Techniques  for  Analyzing  Microbial 
Population  Structure.  Woods  Hole,  MA,  2014 

•  "Host-microbiome  transcriptional  crosstalk  and  clinical  outcome  in  a  large  ileal  pouch-anal 
anastomosis  (IPAA)  cohort,"  109th  International  Titisee  Conference.  Titisee,  Germany,  2014 

•  "Microbiome  Bioinformatics  Tools:  A  Tutorial,"  Keystone  Symposium  on  Exploiting  and 
Understanding  Chemical  Biotransformations  in  the  Human  Microbiome.  Big  Sky,  MO,  2014 
(presented  by  Xochitl  Morgan) 

•  "A  Tour  of  the  BioBakery:  Computational  Tools  for  Microbial  Community  Analysis," 
Harvard  School  of  Public  Health  Program  in  Quantitative  Genomics  Short  Course  series. 
Boston,  MA,  2014  (presented  by  Eric  Franzosa) 

•  "High-precision  functional  profiling  and  integration  of  metagenomes  and 
metatranscrip  tomes,"  Weizmann  Institute  Systems  Biology  Seminar  Series.  Rehovot,  Israel, 
2013 


•  "Bug  bytes:  bioinformatics  for  the  human  microbiome  in  health  and  disease,"  University  of 
Michigan  Molecular  and  Clinical  Epidemiology  of  Infectious  Diseases  (MAC-EPID) 
symposium.  Ann  Arbor,  MI,  2013 

•  "Computational  methods  for  meta'omic  characterization  of  the  human  microbiome,"  Tufts 
Computer  Science  Department  Colloquium  Series.  Medford,  MA,  2013 

•  "Interaction  of  Host  Gene  Expression  and  the  Human  Gut  Microbiome  in  Pouchitis," 
INEORMS  Annual  Meeting.  Minneapolis,  MN,  2013  (presented  by  Eevi  Waldron) 

•  "Adding  depth  to  human  microbiome  studies  with  multi ’omic  data  integration,"  International 
Human  Microbiome  Congress.  Hangzhou,  China,  2013 

•  "Bug  bytes:  bioinformatics  for  the  human  microbiome  in  health  and  disease,"  Canadian 
Student  Health  Research  Eorum.  Alberta,  Canada,  2013 

•  "Prom  microbes  to  microbiota  and  back:  using  thousands  of  genomes  to  understand 
thousands  of  metagenomes,"  Harvard  School  of  Public  Health  Bioinformatics  Core  Eorum. 
Boston,  MA,  2013 

•  "Prom  microbes  to  microbiota  and  back:  using  thousands  of  genomes  to  understand 
thousands  of  metagenomes,"  Symposium  and  Workshop  on  New  Methods  for 
Phylogenomics.  Austin,  TX,  2013 

•  "Computational  methods  for  meta'omic  characterization  of  the  human  microbiome,"  Eos 
Alamos  National  Eaboratory  Center  for  Nonlinear  Studies  seminar.  Eos  Alamos,  NM,  2013 

•  "An  introduction  to  metagenomics,"  Strategies  and  Techniques  for  Analyzing  Microbial 
Population  Structure.  Woods  Hole,  MA,  2013 

•  "Bug  bytes:  Computational  analysis  methods  for  microbial  communities,"  University  of 
Oregon  BioBE  center  seminar.  Eugene,  OR,  2013 

•  "Prom  microbial  surveys  to  mechanisms  of  interaction  in  the  human  microbiome,"  University 
of  Colorado  at  Boulder  BioProntiers  Institute  seminar.  Boulder,  CO,  2013 

•  "Detailing  the  human  microbiome  with  meta'omics,"  New  England  Primate  Research  Center. 
Southboro,  MA,  2012 

•  "Computational  methods  for  meta'omic  characterization  of  the  human  microbiome,"  Porsyth 
Institute  seminar.  Cambridge,  MA,  2012 

•  "Computational  methods  for  meta'omic  characterization  of  the  human  microbiome,"  Procter 
and  Gamble  BioPusion  Symposium.  Cincinnati,  OH,  2012 

•  "Bug  bytes:  Computational  analysis  methods  for  microbial  communities,"  Army  Research 
Office  workshop  on  the  skin  microbiome.  Boulder,  CO,  2012 

•  "Meta'omic  Characterization  of  Microbial  Community  Punction  in  Health  and  Disease," 
Mount  Sinai  School  of  Medicine  Department  of  Health  Evidence  and  Policy  Grand  Rounds. 
New  York,  NY,  2012 

•  "Bug  bytes:  Computational  analysis  methods  for  microbial  communities,"  Carnegie  Mellon 
Pane  Center  for  Computational  Biology  seminar.  Pittsburgh,  PA,  2012 

•  "Meta'omic  characterization  of  microbial  community  function  in  health  and  disease," 
American  Society  for  Microbiology  Conference  on  Beneficial  Microbes.  San  Antonio,  TX, 
2012 

•  "Bug  bytes:  bioinformatics  for  metagenomics  and  microbial  community  analysis,"  Eewis- 
Sigler  Institute  for  Integrative  Genomics  seminar.  Princeton,  NJ,  2012 

•  "Bug  bytes:  bioinformatics  for  metagenomics  and  microbial  community  analysis,"  8th 
International  Purdue  Symposium  on  Statistics.  West  Eafayette,  IN,  2012 


•  "Bug  bytes:  computational  methods  for  microbial  community  analysis,"  Woods  Hole  Marine 
Biology  Laboratory  seminar.  Woods  Hole,  MA,  2012 

•  "Computational  tools  for  functional  analysis  of  microbial  communities,"  Cloud  Computing 
for  the  Microbiome  Workshop.  Boulder,  CO,  2012 

The  software  is  currently  available  with  documentation  and  demonstration  data  at 
http://huttenhower.sph.harvard.edu/maaslin,  with  an  additional  Galaxy  interface  online  at 
http://huttenhower.sph.harvard.edu/galaxy.  The  project  is  currently  led  by  postdoctoral  fellow 
Dr.  Ayshwarya  Subramanian,  with  previous  contributions  by  former  postdoctoral  fellows  Dr. 
Soumya  Bannerjee  (currently  at  Children's  Hospital)  and  Dr.  Timothy  Tickle  (currently  at  the 
Broad  Institute),  Ph.D.  student  Emma  Schwager,  and  undergraduate  research  assistant  Yiren  Lu. 


